Inference tutorial - Part 3 of e2e series [WIP] #2343

jainapurva · 2025-06-09T23:18:31Z

No description provided.

pytorch-bot · 2025-06-09T23:18:34Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2343

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

VolumeLimitExceeded Issue for linux.2xlarge and linux.4xlarge

✅ No Failures

As of commit 52b93fe with merge base 2898903 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

docs/source/inference.rst

jerryzh168 · 2025-06-17T20:44:43Z

docs/source/inference.rst

+
+    vllm serve pytorch/Phi-4-mini-instruct-float8dq --tokenizer microsoft/Phi-4-mini-instruct -O3
+
+Inference with vLLM


should we move this after Inference with Transformers

cc @jainapurva I think if vLLM is our recommended serving solution, this should go before transformers.

jerryzh168 · 2025-06-17T20:45:36Z

docs/source/inference.rst

+
+vLLM automatically leverages torchao's optimized kernels when serving quantized models, providing significant throughput improvements.
+
+Setting up vLLM with Quantized Models


nit: this doesn't have to be a new section I think

docs/source/inference.rst

andrewor14 · 2025-06-17T21:48:34Z

Hi @jainapurva, by the way I'm adding a serving.rst here: #2394. It uses the same template as parts 1 and 2. After that's landed, do you mind updating your PR to use that file instead? Right now it's a blank page with the template:

docs/source/inference.rst

jerryzh168 · 2025-06-18T23:51:43Z

docs/source/inference.rst

+.. note::
+    For more information on supported quantization and sparsity configurations, see `HF-Torchao Docs <https://huggingface.co/docs/transformers/main/en/quantization/torchao>`_.
+
+Inference with vLLM


for this section, can you replace with https://huggingface.co/pytorch/Qwen3-8B-int4wo-hqq#inference-with-vllm

it might be easier to do command line compared to code

drisspg · 2025-06-23T16:31:31Z

docs/source/serving.rst

+            print(f"Output:    {generated_text!r}")
+            print("-" * 60)
+
+[Optional] Inference with Transformers


We should have an Inference w/ SGlang section

I tested the integration of TorchAO and SGLang, came across a lot of issues in running the server. As discussed with @jerryzh168 offline, we can add this later, after more thorough testing and updates.

Should we at least add an SGLang section and say (Coming soon!) or something? It's in the diagram at the top right now so people may search for it

andrewor14

Looks great, overall I feel we should add some more text in between code blocks so it feels more like a tutorial, and remove some duplicate code, which is distracting to readers

andrewor14 · 2025-06-26T20:39:15Z

docs/source/serving.rst

+    quantized_model.push_to_hub(save_to, safe_serialization=False)
+    tokenizer.push_to_hub(save_to)
+
+    # Manual Testing


I would split this into a separate code block and add some text in between, since everything below this line is technically not part of the user flow

andrewor14 · 2025-06-26T20:39:38Z

docs/source/serving.rst

+    output_text = tokenizer.batch_decode(
+        generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
+    )
+    print("Response:", output_text[0][len(prompt):])


Can you also add an example of what is printed here?

andrewor14 · 2025-06-26T20:40:57Z

docs/source/serving.rst

+
+    model_id = "microsoft/Phi-4-mini-instruct"
+
+    from torchao.quantization import Float8DynamicActivationFloat8WeightConfig, PerRow


nit: move this to the top like the other imports?

andrewor14 · 2025-06-26T20:42:52Z

docs/source/serving.rst

+    pip install vllm --pre --extra-index-url https://wheels.vllm.ai/nightly
+    pip install --pre torchao --index-url https://download.pytorch.org/whl/nightly/cu126
+
+.. code-block:: bash


I think we need to add some text here. E.g. we need to explain we're serving with the quantized checkpoint we pushed to HF hub above. Also would be good to clarify that "float8dq" stands for "dynamic quant"

andrewor14 · 2025-06-26T20:45:55Z

docs/source/serving.rst

+    "top_p": 0.95,
+    "top_k": 20,
+    "max_tokens": 32768
+    }'


I think we should also quickly summarize that serving the float8 with vLLM is X times faster than serving the original high precision model, e.g. from the model card: https://huggingface.co/pytorch/Phi-4-mini-instruct-float8dq#quantization-recipe

Serve using vLLM with 36% VRAM reduction, 1.15x-1.2x speedup and little to no accuracy impact on H100.

andrewor14 · 2025-06-26T20:48:30Z

docs/source/serving.rst

+    from transformers import (
+    AutoModelForCausalLM,
+    AutoProcessor,
+    AutoTokenizer,


nit: formatting is off, here and other code blocks below

andrewor14 · 2025-06-26T20:48:46Z

docs/source/serving.rst

+    tokenizer = AutoTokenizer.from_pretrained(model_id)
+
+    print(untied_model)
+    from transformers.modeling_utils import find_tied_parameters


nit: move import to top

andrewor14 · 2025-06-26T20:50:25Z

docs/source/serving.rst

+Step 1: Untie Embedding Weights
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+We want to quantize the embedding and lm_head differently. Since those layers are tied, we first need to untie the model:


Is this step necessary actually? I don't think I had to do any of this for Llama models for example. Can you share the source for this?

andrewor14 · 2025-06-26T20:53:58Z

docs/source/serving.rst

+
+    torch.cuda.reset_peak_memory_stats()
+
+    prompt = "Hey, are you conscious? Can you talk to me?"


I see this code duplicated 3 times. Can we just have it appear in 1 place? Having it under "Evaluation" makes sense to me

andrewor14 · 2025-06-26T20:55:02Z

docs/source/serving.rst

+
+Memory Benchmarking
+^^^^^^^^^^^^^^^^^
+


Should add some text here. "For the Phi-4-mini-instruct model, serving with float8 dynamic quantization used X% less memory" or something

Preliminary structure for tutorial

c0584b4

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 9, 2025

jainapurva added the topic: documentation Use this tag if this PR adds or improves documentation label Jun 10, 2025

jainapurva and others added 8 commits June 16, 2025 09:59

Updates

f4e8f2d

Update

7c2332e

Update

942a02b

Update

888fd4c

Update

c200cd2

Merge remote-tracking branch 'origin/main' into inference_tutorial

4f76b23

Update

c52e6f8

Update

de160b1

jerryzh168 reviewed Jun 17, 2025

View reviewed changes

docs/source/inference.rst Outdated Show resolved Hide resolved

jerryzh168 reviewed Jun 17, 2025

View reviewed changes

docs/source/inference.rst Outdated Show resolved Hide resolved

jerryzh168 reviewed Jun 17, 2025

View reviewed changes

docs/source/inference.rst Outdated Show resolved Hide resolved

jerryzh168 reviewed Jun 17, 2025

View reviewed changes

docs/source/inference.rst Outdated Show resolved Hide resolved

jerryzh168 reviewed Jun 17, 2025

View reviewed changes

docs/source/inference.rst Outdated Show resolved Hide resolved

jainapurva added 2 commits June 17, 2025 12:11

Update

e8f5e53

Update

bbd567d

jainapurva commented Jun 17, 2025

View reviewed changes

docs/source/inference.rst Outdated Show resolved Hide resolved

jainapurva requested review from jerryzh168, andrewor14, drisspg and jcaip June 17, 2025 20:42

jerryzh168 reviewed Jun 17, 2025

View reviewed changes

docs/source/inference.rst Outdated Show resolved Hide resolved

jerryzh168 reviewed Jun 17, 2025

View reviewed changes

Update notes

6a96697

jerryzh168 reviewed Jun 17, 2025

View reviewed changes

docs/source/inference.rst Outdated Show resolved Hide resolved

jcaip reviewed Jun 18, 2025

View reviewed changes

docs/source/inference.rst Outdated Show resolved Hide resolved

jainapurva added 3 commits June 18, 2025 12:11

Updates

06612d3

Merge remote-tracking branch 'origin/main' into inference_tutorial

a3aa301

Updates

ce675b8

jainapurva force-pushed the inference_tutorial branch from b93b892 to ce675b8 Compare June 18, 2025 21:05

jerryzh168 reviewed Jun 18, 2025

View reviewed changes

docs/source/inference.rst Outdated Show resolved Hide resolved

jerryzh168 reviewed Jun 18, 2025

View reviewed changes

drisspg reviewed Jun 23, 2025

View reviewed changes

jainapurva and others added 7 commits June 23, 2025 11:01

Updates

0311bc0

Updates to build torchao

2c44d25

Merge remote-tracking branch 'origin/main' into inference_tutorial

b163ef7

Updates to vllm serving

580a99c

Updates to vllm serving

17b7cb8

Fix formatting

bd2600f

Fix formatting issues

52b93fe

andrewor14 approved these changes Jun 26, 2025

View reviewed changes


		vllm serve pytorch/Phi-4-mini-instruct-float8dq --tokenizer microsoft/Phi-4-mini-instruct -O3

		Inference with vLLM


		vLLM automatically leverages torchao's optimized kernels when serving quantized models, providing significant throughput improvements.

		Setting up vLLM with Quantized Models


		model_id = "microsoft/Phi-4-mini-instruct"

		from torchao.quantization import Float8DynamicActivationFloat8WeightConfig, PerRow


		torch.cuda.reset_peak_memory_stats()

		prompt = "Hey, are you conscious? Can you talk to me?"


		Memory Benchmarking
		^^^^^^^^^^^^^^^^^

Inference tutorial - Part 3 of e2e series [WIP] #2343

Are you sure you want to change the base?

Inference tutorial - Part 3 of e2e series [WIP] #2343

Uh oh!

Conversation

jainapurva commented Jun 9, 2025

Uh oh!

pytorch-bot bot commented Jun 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2343

❗ 1 Active SEVs

✅ No Failures

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

andrewor14 commented Jun 17, 2025

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andrewor14 Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andrewor14 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pytorch-bot bot commented Jun 9, 2025 •

edited

Loading

andrewor14 Jun 26, 2025 •

edited

Loading